Search CORE

20 research outputs found

Self Supervision Does Not Help Natural Language Supervision at Scale

Author: Gunter Tom
Katharopoulos Angelos
Shankar Vaishaal
Weers Floris
Yang Yinfei
Publication venue
Publication date: 20/01/2023
Field of study

Self supervision and natural language supervision have emerged as two exciting ways to train general purpose image encoders which excel at a variety of downstream tasks. Recent works such as M3AE and SLIP have suggested that these approaches can be effectively combined, but most notably their results use small pre-training datasets (<50M samples) and don't effectively reflect the large-scale regime (>100M examples) that is commonly used for these approaches. Here we investigate whether a similar approach can be effective when trained with a much larger amount of data. We find that a combination of two state of the art approaches: masked auto-encoders, MAE and contrastive language image pre-training, CLIP provides a benefit over CLIP when trained on a corpus of 11.3M image-text pairs, but little to no benefit (as evaluated on a suite of common vision tasks) over CLIP when trained on a large corpus of 1.4B images. Our work provides some much needed clarity into the effectiveness (or lack thereof) of self supervision for large-scale image-text training

arXiv.org e-Print Archive

On Robustness in Multimodal Learning

Author: Cheng Joseph
McKinzie Brandon
Shankar Vaishaal
Shlens Jonathon
Toshev Alexander
Yang Yinfei
Publication venue
Publication date: 10/04/2023
Field of study

Multimodal learning is defined as learning over multiple heterogeneous input modalities such as video, audio, and text. In this work, we are concerned with understanding how models behave as the type of modalities differ between training and deployment, a situation that naturally arises in many applications of multimodal learning to hardware platforms. We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods. Further, we identify robustness short-comings of these approaches and propose two intervention techniques leading to

1.5\times

4\times

robustness improvements on three datasets, AudioSet, Kinetics-400 and ImageNet-Captions. Finally, we demonstrate that these interventions better utilize additional modalities, if present, to achieve competitive results of

44.2

mAP on AudioSet 20K

arXiv.org e-Print Archive

TiC-CLIP: Continual Training of CLIP Models

Author: Faghri Fartash
Farajtabar Mehrdad
Garg Saurabh
Mehta Sachin
Pouransari Hadi
Shankar Vaishaal
Tuzel Oncel
Vemulapalli Raviteja
Publication venue
Publication date: 24/10/2023
Field of study

Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataCompt, TiC-YFCC, and TiC-RedCaps with over 12.7B timestamped image-text pairs spanning 9 years (2014--2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI's CLIP (trained on data up to 2020) loses

\approx 8\%

zero-shot accuracy on our curated retrieval task from 2021--2022 compared with more recently trained models in OpenCLIP repository. We then study how to efficiently train models on time-continuous data. We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by

2.5\times

when compared to the standard practice of retraining from scratch

arXiv.org e-Print Archive